{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<!--BOOK_INFORMATION-->\n",
    "<a href=\"https://www.packtpub.com/big-data-and-business-intelligence/machine-learning-opencv\" target=\"_blank\"><img align=\"left\" src=\"data/cover.jpg\" style=\"width: 76px; height: 100px; background: white; padding: 1px; border: 1px solid black; margin-right:10px;\"></a>\n",
    "*This notebook contains an excerpt from the book [Machine Learning for OpenCV](https://www.packtpub.com/big-data-and-business-intelligence/machine-learning-opencv) by Michael Beyeler.\n",
    "The code is released under the [MIT license](https://opensource.org/licenses/MIT),\n",
    "and is available on [GitHub](https://github.com/mbeyeler/opencv-machine-learning).*\n",
    "\n",
    "*Note that this excerpt contains only the raw code - the book is rich with additional explanations and illustrations.\n",
    "If you find this content useful, please consider supporting the work by\n",
    "[buying the book](https://www.packtpub.com/big-data-and-business-intelligence/machine-learning-opencv)!*"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<!--NAVIGATION-->\n",
    "< [Implementing AdaBoost](10.04-Implementing-AdaBoost.ipynb) | [Contents](../README.md) | [Selecting the Right Model with Hyper-Parameter Tuning](11.00-Selecting-the-Right-Model-with-Hyper-Parameter-Tuning.ipynb) >"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "# Combining Different Models Into a Voting Classifier\n",
    "\n",
    "So far, we saw how to combine different instances of the same classifier or regressor into an\n",
    "ensemble. In this chapter, we are going to take this idea a step further and combine\n",
    "conceptually different classifiers into what is known as a **voting classifier**.\n",
    "\n",
    "The idea behind voting classifiers is that the individual learners in the ensemble don't\n",
    "necessarily need to be of the same type. After all, no matter how the individual classifiers\n",
    "arrived at their prediction, in the end, we are going to apply a decision rule that integrates\n",
    "all the votes of the individual classifiers. This is also known as a **voting scheme**.\n",
    "\n",
    "Two different voting schemes are common among voting classifiers:\n",
    "- In **hard voting** (also known as **majority voting**), every individual classifier votes for a class, and the majority wins. In statistical terms, the predicted target label of the ensemble is the mode of the distribution of individually predicted labels.\n",
    "- In **soft voting**, every individual classifier provides a probability value that a specific data point belongs to a particular target class. The predictions are weighted by the classifier's importance and summed up. Then the target label with the greatest sum of weighted probabilities wins the vote.\n",
    "\n",
    "You can find an example of how these voting schemes work in practice in the book."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Implementing a Voting Classifier\n",
    "\n",
    "Let's look at a simple example of a voting classifier that combines three different algorithms:\n",
    "- A logistic regression classifier from [Chapter 3](03.00-First-Steps-in-Supervised-Learning.ipynb), *First Steps in Supervised Learning*\n",
    "- A Gaussian naive Bayes classifier from [Chapter 7](07.00-Implementing-a-Spam-Filter-with-Bayesian-Learning.ipynb), *Implementing a Spam Filter with Bayesian Learning*\n",
    "- A random forest classifier from this chapter"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can combine these three algorithms into a voting classifier and apply it to the breast\n",
    "cancer dataset with the following steps.\n",
    "\n",
    "Load the dataset, and split it into training and test sets:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from sklearn.datasets import load_breast_cancer\n",
    "iris = load_breast_cancer()\n",
    "X = iris.data\n",
    "y = iris.target"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.model_selection import train_test_split\n",
    "X_train, X_test, y_train, y_test = train_test_split(\n",
    "    X, y, random_state=13\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Instantiate the individual classifiers:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from sklearn.linear_model import LogisticRegression\n",
    "model1 = LogisticRegression(random_state=13)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from sklearn.naive_bayes import GaussianNB\n",
    "model2 = GaussianNB()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from sklearn.ensemble import RandomForestClassifier\n",
    "model3 = RandomForestClassifier(random_state=13)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Assign the individual classifiers to the voting ensemble. Here, we need to pass a\n",
    "list of tuples (`estimators`), where every tuple consists of the name of the\n",
    "classifier (a string of our choosing) and the model object. The voting scheme can\n",
    "be either `voting='hard'` or `voting='soft'`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from sklearn.ensemble import VotingClassifier\n",
    "vote = VotingClassifier(estimators=[('lr', model1),\n",
    "                                    ('gnb', model2),\n",
    "                                    ('rfc', model3)],\n",
    "                        voting='hard')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Fit the ensemble to the training data and score it on the test data:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.95104895104895104"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "vote.fit(X_train, y_train)\n",
    "vote.score(X_test, y_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In order to convince us that 95.1% is a great accuracy score, we can compare the ensemble's\n",
    "performance to the theoretical performance of each individual classifier. We do this by\n",
    "fitting the individual classifiers to the data. Then we will see that the logistic regression\n",
    "model achieves 94.4% accuracy on its own:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.94405594405594406"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "model1.fit(X_train, y_train)\n",
    "model1.score(X_test, y_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Similarly, the naive Bayes classifier achieves 93.0% accuracy:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.93006993006993011"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "model2.fit(X_train, y_train)\n",
    "model2.score(X_test, y_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Last but not least, the random forest classifier also achieved 94.4% accuracy:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.94405594405594406"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "model3.fit(X_train, y_train)\n",
    "model3.score(X_test, y_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "All in all, we were just able to gain a good percent in performance by combining three\n",
    "unrelated classifiers into an ensemble. Each of these classifiers might have made different\n",
    "mistakes on the training set, but that's OK because on average, we need just two out of three\n",
    "classifiers to be correct."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<!--NAVIGATION-->\n",
    "< [Implementing AdaBoost](10.04-Implementing-AdaBoost.ipynb) | [Contents](../README.md) | [Selecting the Right Model with Hyper-Parameter Tuning](11.00-Selecting-the-Right-Model-with-Hyper-Parameter-Tuning.ipynb) >"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
